Recently, due to the increasing requirements of medical imaging applications and the professional requirements of annotating medical images, few-shot learning has gained increasing attention in the medical image semantic segmentation field. To perform segmentation with limited number of labeled medical images, most existing studies use Proto-typical Networks (PN) and have obtained compelling success. However, these approaches overlook the query image features extracted from the proposed representation network, failing to preserving the spatial connection between query and support images. In this paper, we propose a novel self-supervised few-shot medical image segmentation network and introduce a novel Cycle-Resemblance Attention (CRA) module to fully leverage the pixel-wise relation between query and support medical images. Notably, we first line up multiple attention blocks to refine more abundant relation information. Then, we present CRAPNet by integrating the CRA module with a classic prototype network, where pixel-wise relations between query and support features are well recaptured for segmentation. Extensive experiments on two different medical image datasets, e.g., abdomen MRI and abdomen CT, demonstrate the superiority of our model over existing state-of-the-art methods.
translated by 谷歌翻译
In this work, we propose a semantic flow-guided two-stage framework for shape-aware face swapping, namely FlowFace. Unlike most previous methods that focus on transferring the source inner facial features but neglect facial contours, our FlowFace can transfer both of them to a target face, thus leading to more realistic face swapping. Concretely, our FlowFace consists of a face reshaping network and a face swapping network. The face reshaping network addresses the shape outline differences between the source and target faces. It first estimates a semantic flow (i.e., face shape differences) between the source and the target face, and then explicitly warps the target face shape with the estimated semantic flow. After reshaping, the face swapping network generates inner facial features that exhibit the identity of the source face. We employ a pre-trained face masked autoencoder (MAE) to extract facial features from both the source face and the target face. In contrast to previous methods that use identity embedding to preserve identity information, the features extracted by our encoder can better capture facial appearances and identity information. Then, we develop a cross-attention fusion module to adaptively fuse inner facial features from the source face with the target facial attributes, thus leading to better identity preservation. Extensive quantitative and qualitative experiments on in-the-wild faces demonstrate that our FlowFace outperforms the state-of-the-art significantly.
translated by 谷歌翻译
Purpose: Vision-based robot tool segmentation plays a fundamental role in surgical robots and downstream tasks. CaRTS, based on a complementary causal model, has shown promising performance in unseen counterfactual surgical environments in the presence of smoke, blood, etc. However, CaRTS requires over 30 iterations of optimization to converge for a single image due to limited observability. Method: To address the above limitations, we take temporal relation into consideration and propose a temporal causal model for robot tool segmentation on video sequences. We design an architecture named Temporally Constrained CaRTS (TC-CaRTS). TC-CaRTS has three novel modules to complement CaRTS - temporal optimization pipeline, kinematics correction network, and spatial-temporal regularization. Results: Experiment results show that TC-CaRTS requires much fewer iterations to achieve the same or better performance as CaRTS. TC- CaRTS also has the same or better performance in different domains compared to CaRTS. All three modules are proven to be effective. Conclusion: We propose TC-CaRTS, which takes advantage of temporal constraints as additional observability. We show that TC-CaRTS outperforms prior work in the robot tool segmentation task with improved convergence speed on test datasets from different domains.
translated by 谷歌翻译
给定标签噪声的数据(即数据不正确),深神经网络将逐渐记住标签噪声和损害模型性能。为了减轻此问题,提出了课程学习,以通过在有意义的(例如,易于硬)序列中订购培训样本来提高模型性能和概括。先前的工作将错误的样本作为通用的硬性样本,而无需区分硬样品(即正确数据中的硬样品)和不正确的样本。确实,模型应该从硬样本中学习,以促进概括而不是过度拟合错误。在本文中,我们通过在现有的任务损失之外附加新颖的损失函数Indimloss来解决此问题。它的主要影响是在训练的早期阶段自动,稳定地估计简易样品和困难样本(包括硬和不正确的样品)的重要性,以改善模型性能。然后,在以下阶段中,歧视专门用于区分硬性和不正确样本以改善模型的概括。这种培训策略可以以自我监督的方式动态制定,从而有效地模仿课程学习的主要原则。关于图像分类,图像回归,文本序列回归和事件关系推理的实验证明了我们方法的多功能性和有效性,尤其是在存在多样化的噪声水平的情况下。
translated by 谷歌翻译
多模式学习,尤其是大规模的多模式预训练,在过去的几年中已经迅速发展,并带来了人工智能(AI)的最大进步。尽管具有有效性,但了解多模式预训练模型的潜在机制仍然是一个巨大的挑战。揭示此类模型的解释性可能会使AI领域中新型学习范式的突破。为此,鉴于人脑的多模式性质,我们建议借助非侵入性脑成像技术(例如功能磁共振成像(fMRI))探索多模式学习模型的解释性。具体而言,我们首先提出了1500万个图像文本对预训练的新设计的多模式基础模型,该模型在各种认知下游任务中显示出强烈的多模式理解和概括能力。此外,从神经编码的角度来看(基于我们的基础模型),我们发现,与单峰相比,经过多模式训练的视觉和舌编码器都更像脑状。特别是,我们确定了许多大脑区域,其中多模式训练的编码器表现出更好的神经编码性能。这与现有有关探索大脑多感觉整合的研究的发现是一致的。因此,我们认为,多模式基础模型是神经科学家研究人脑中多模式信号处理机制的更合适的工具。我们的发现还证明了多模式基础模型作为理想的计算模拟器的潜力,以促进脑和大脑的AI研究。
translated by 谷歌翻译
智能机器之间合作的必要性已在人工智能(AI)研究界普及了合作的多代理增强学习(MARL)。但是,许多研究的努力一直集中在开发实用的MARL算法上,其有效性仅在经验上进行了研究,从而缺乏理论保证。正如最近的研究所表明的那样,MARL方法通常达到奖励单调性或收敛性次优的性能。为了解决这些问题,在本文中,我们介绍了一个名为异质的镜像学习(HAML)的新颖框架,该框架为MARL算法设计提供了一个通用模板。我们证明,源自HAML模板的算法满足了关节奖励的单调改善的所需特性以及与NASH平衡的收敛性。我们通过证明当前最新的合作社Marl算法,HATRPO和HAPKO实际上是HAML实例,来验证HAML的实用性。接下来,作为我们理论的自然结果,我们提出了两种众所周知的RL算法HAA2C(用于A2C)和HADDPG(用于DDPG)的HAML扩展,并证明了它们针对StarcraftII和多代理Mujoco任务的强大基准的有效性。
translated by 谷歌翻译
旨在用自然语言和谐地与人类交流的智能对话体系对于促进人工智能时代的人机互动的发展非常出色。有了逐渐复杂的人类计算机交互要求(例如,多模式输入,时间敏感性),传统的基于文本的对话系统很难满足对更加生动和方便的交互的需求。因此,视觉背景增强对话系统(VAD)有可能通过感知和理解多模式信息(即图像或视频中的视觉上下文,文本对话历史记录)与人类进行交流,已成为主要的研究范式。 VAD受益于视觉和文本上下文之间的一致性和互补性,具有产生引人入胜和背景感知响应的潜力。为了描述VAD的开发,我们首先表征VAD的概念和独特功能,然后介绍其通用系统体系结构以说明系统工作流程。随后,对一些研究挑战和代表性作品进行了详细研究,然后进行了权威基准摘要。我们通过提出一些开放问题和有前途的VAD研究趋势来结束本文,例如,在跨模式对话环境下,人机对话的认知机制以及知识增强的跨模式语义互动。
translated by 谷歌翻译
图形离群值检测是一项具有许多应用程序的新兴但至关重要的机器学习任务。尽管近年来算法扩散,但缺乏标准和统一的绩效评估设置限制了它们在现实世界应用中的进步和使用。为了利用差距,我们(据我们所知)(据我们所知)第一个全面的无监督节点离群值检测基准为unod,并带有以下亮点:(1)评估骨架从经典矩阵分解到最新图形神经的骨架的14个方法网络; (2)在现实世界数据集上使用不同类型的注射异常值和自然异常值对方法性能进行基准测试; (3)通过在不同尺度的合成图上使用运行时和GPU存储器使用算法的效率和可扩展性。基于广泛的实验结果的分析,我们讨论了当前渠道方法的利弊,并指出了多个关键和有希望的未来研究方向。
translated by 谷歌翻译
现代的多层感知器(MLP)模型在不自我注意力的情况下学习视觉表现方面显示了竞争成果。但是,现有的MLP模型不擅长捕获本地细节,并且缺乏人类配置的先验知识,这限制了其骨骼表示学习的模型能力。为了解决这些问题,我们提出了一个名为GraphMLP的简单而有效的图形增强的MLP样结构,该体系结构将MLP和图形卷积网络(GCN)组合在3D人类姿势估计的全球 - 局部 - 单位图形统一体系中。GraphMLP将人体的图结构结合到MLP模型中,以满足域特异性需求,同时允许局部和全局空间相互作用。广泛的实验表明,所提出的GraphMLP在两个数据集(即Human3.6M和MPI-INF-3DHP)上实现了最先进的性能。我们的源代码和预估计的模型将公开可用。
translated by 谷歌翻译
Context-aware decision support in the operating room can foster surgical safety and efficiency by leveraging real-time feedback from surgical workflow analysis. Most existing works recognize surgical activities at a coarse-grained level, such as phases, steps or events, leaving out fine-grained interaction details about the surgical activity; yet those are needed for more helpful AI assistance in the operating room. Recognizing surgical actions as triplets of <instrument, verb, target> combination delivers comprehensive details about the activities taking place in surgical videos. This paper presents CholecTriplet2021: an endoscopic vision challenge organized at MICCAI 2021 for the recognition of surgical action triplets in laparoscopic videos. The challenge granted private access to the large-scale CholecT50 dataset, which is annotated with action triplet information. In this paper, we present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge. A total of 4 baseline methods from the challenge organizers and 19 new deep learning algorithms by competing teams are presented to recognize surgical action triplets directly from surgical videos, achieving mean average precision (mAP) ranging from 4.2% to 38.1%. This study also analyzes the significance of the results obtained by the presented approaches, performs a thorough methodological comparison between them, in-depth result analysis, and proposes a novel ensemble method for enhanced recognition. Our analysis shows that surgical workflow analysis is not yet solved, and also highlights interesting directions for future research on fine-grained surgical activity recognition which is of utmost importance for the development of AI in surgery.
translated by 谷歌翻译